Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing
نویسندگان
چکیده
"!# $ &% ' (*) + !-,. / 0 "' . + 1 . !" /, 32546 7 + 8' 9: !# + ;9< 9: =' !->? . + @' +!# ,5 !-,. !BA 8' >B(+ C ' ;D !. 5 !+ E "' (6 F !-,. G H "' I . + !, ' >#!8'3 !. JC 6>B . , + &% ' (*) K' L !B M' 6 ' >->B(+ 9 N M . ,. " O !-OP &% =' !-># M2Q!B . M R* ; !# , 9 >C N( N S =' !#>.4. + N F &% '8(*) F !-,. G 0 "' C + 9 !,5' >-!8' !. E 8' T E O&!#O. C &% C =' !-># I' %U4. H .4 2Q OP 5 ; !G' &% 0' !->.4F V'8OP T 1' 5 + >-OP ' V%W N L' X Y . Z [ , !!,\ 7'] N L' >9< N L' , 9< V' 8%
منابع مشابه
Disaster Survival Guide in Petascale Computing: An Algorithmic Approach
1 Disaster Survival Guide in Petascale Computing: An Algorithmic Approach 3 Jack J. Dongarra, Zizhong Chen, George Bosilca, and Julien Langou 1.1 FT-MPI: A fault tolerant MPI implementation . . . . . . . . 6 1.1.1 FT-MPI Overview . . . . . . . . . . . . . . . . . . . . 6 1.1.2 FT-MPI: A Fault Tolerant MPI Implementation . . . 6 1.1.3 FT-MPI Usage . . . . . . . . . . . . . . . . . . . . . . 7 1....
متن کاملMPI/FT: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing
MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault -tolerant MPI middleware. Environments include space -based, wide -area/web/meta computing, and scalable clusters. MPI/FT , the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirem...
متن کاملIn-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this pape...
متن کاملMPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing
MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing, and scalable clusters. MPI/FT, the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements...
متن کاملFault Tolerance in MPI Programs
This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that within certain constraints, ...
متن کامل